Morphological Tags in Parallel Corpora

نویسنده

  • Alexandr Rosen
چکیده

Multilingual parallel corpora can be annotated with morphosyntactic tags by monolingual tools, freely available for a number of different languages. However, each of the tools is typically bundled with a specific tagset and assumes a specific way of tokenization. The variety of tagging schemes and tag formats may be a problem for the user: a relatively simple tag query in a multilingual setting often means spending a while with tagset manuals. The aim of the present contribution is to suggest a solution that would delegate the task of dealing with multiple tagsets to the system. The core component of the proposal can be viewed as an abstract interlingual tagset. It is actually a hierarchy of linguistic categories, partially ordered by their specificity, mapped to tags in languagespecific tagsets. In order to capture different views of word classes, as seen by the tagsets, the common tagset takes three different perspectives of word class: lexical, inflectional and syntactic, each potentially coupled with its own set of morphological categories. Thus, the tag for the Czech relative pronoun který ‘which’ is decoded as a category with the properties of lexical pronoun, inflectional adjective and syntactic noun, each with its appropriate morphological characteristics. The common tagset is formalised as a tangled hierarchy of types, each of the types corresponding to a linguistic category and some of the types to one or more languagespecific tags. Tags in all tagsets can be described as objects with properties such as lexical, inflectional and syntactic word class, and the relevant morphological categories. Then the standard methods of Formal Concept Analysis (Ganter & Wille, 1999) can be used to construct the hierarchy automatically as a concept lattice and to (partially) resolve tag queries that do not quite match the tags used for the specific language. in a way similar to that used by Janssen (2004) for dealing with lexical gaps in the multilingual lexical database. Language-specific subsets of the abstract common tagset can be extracted using the links to tags in language-specific tagsets. Abstract language-specific tagsets can be used to generate or interpret tags in a format of the user’s or a tool’s preference. In addition, the modular setup allows for underspecified tag queries and for mappings between tagsets

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Disambiguation of Single Noun Translations Extracted from Bilingual Comparable Corpora

s of papers of four academic societies, namely Japan Architecture Society (JAS), Institute of Electric Engineering (IEE), Institute of Electronics and Communication Engineering (IECE), and Information Processing Society of Japan (IPSJ), published in Japan. Numbers of abstracts of each of these corpora are shown in Table 1. Parts of these bilingual corpora are parallel. The percentages of parall...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Arabic Entity Graph Extraction Using Morphology, Finite State Machines, and Graph Transformations

Research on automatic recognition of named entities from Arabic text uses techniques that work well for the Latin based languages such as local grammars, statistical learning models, pattern matching, and rule-based techniques. These techniques boost their results by using application specific corpora, parallel language corpora, and morphological stemming analysis. We propose a method for extra...

متن کامل

Гармонизация Систем Помет Для Многоязычных Корпусов Посредством Решетки Понятий Harmonizing Tagsets for Multilingual Corpora via Concept Lattice

Multilingual corpora can be annotated with morphosyntactic tags by monolingual tools. However, each of the tools is typically bundled with a specific tagset. This variety of tagging schemes may be a problem for the user: InterCorp, a parallel corpus, currently offers on-line concordances in 22 languages, 11 of them tagged with 11 different tagsets.1 Fig. 1 illustrates the tagset variety using c...

متن کامل

O. Scrivner, T. Gilmanov SWIFT ALIGNER: A TOOL FOR THE VISUALIZATION AND CORRECTION OF WORD ALIGNMENT AND FOR CROSS LANGUAGE TRANSFER

It is well known that parallel corpora are valuable linguistic resources. One of the benefits of such corpora is that they allow for the building an annotated corpus for resource-poor languages via crosslanguage transfer. That is, given accurate alignment between a word from a source language and its equivalent in a target language, some linguistic information, such as part-of-speech tags or sy...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010